Even though there have been many studies undertaken in the past regarding different factors affecting life expectancy, none before have taken into account different inmunization effects and Human Development Index (HDI). Also, most of the research has not been done taking into account both different countries and different years (as in this one). Nevertheless, life expectancy is one of the most important indicators a country has, and therefore an indicator that tries to be improved with time through correct policies. This way, we have to ask ourselves what are the most important variables (thus, what are the most important directions public policy whould take) that helps us predict or understand the value of life expectancy in a Country.
This variable importance is straightforward clear. Hence, this gives motivation to study the relationship of this variables with some others related to economic factors, social factors, health factors inmunization factors and mortality factors. At the same time, it is important to know what variables are not so important in explaining the life expectancy in a country. This way, countries will know where not to put there resources too.
In this dataset, there is information on 193 countries that has been collected from the United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. This is also why data has been collected from the year 2000 to 2015.
In this project I want to find the best model to predict life expectancy in a Country. The goal is actually double: I want to predict this, but I am also interested in finding the most important variables to do this prediction. This way, there may be some very complex model that will held good results predicting, but will be very difficult to interpret. This is why sometimes I will also take into account the parsimony of the model, and not only its measures.
Also, to account for how good is a model I will be taking into account the following: prediction, prediction interval, Mean Absolut Error (MAE), R squared or adjusted R squared (depending on the case, sometimes one and sometimes both) to asses how much variability will be explained by the model, and how parsimonious is the model.
In this project, I will first do data engeneering and pre-processing. Then I will try and find the best model, using the most important variables to get a better performance. Then I will combine the best models to try and get a better one. Finally, I will build prediction intervals for the best model and the best combination of models. I did not build prediction intervals for every model, as most of them are not used for predicting, but to choose variables and interactions.
###Defining Colours
color_1 <- "blue"
color_2 <- "red"
color_3 <- "yellow"
library("mice")
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
library("tidyverse")
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("ggplot2")
library("ggcorrplot")
library("caret")
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library("corrplot")
## corrplot 0.84 loaded
library("FOCI")
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library("Boruta")
library("GGally")
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library("MASS")
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library("olsrr")
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
library("psych")
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library("car")
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
set.seed(45052185)
data <- read.csv("life_expect.csv")
dim(data)
## [1] 2938 22
head(data)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
As we can see, we have a total of 22 variables (columns: 21 independent features and 1 output) and 2938 rows. As my intention is to predict the life expectancy of a country, given the values of some other characteristics of the country, I will drop the variables that account for the name of the country and the year the meassures were taken. This is because, again, it will not add any important information to the analysis (maybe for other analysis, it would be useful). This will be done soon, once we have fixed the missing values.
summary(data)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.0100 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 0.8775 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 3.7550 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.6029 Mean : 738.251
## 3rd Qu.:228.0 3rd Qu.: 22.0 3rd Qu.: 7.7025 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.8700 Max. :19479.912
## NA's :10 NA's :194
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:77.00 1st Qu.: 0.0 1st Qu.:19.30 1st Qu.: 0.00
## Median :92.00 Median : 17.0 Median :43.50 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:97.00 3rd Qu.: 360.2 3rd Qu.:56.20 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :553 NA's :34
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.260 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.755 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.492 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## NA's :19 NA's :226 NA's :19
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 463.94 1st Qu.:1.958e+05 1st Qu.: 1.60
## Median : 1766.95 Median :1.387e+06 Median : 3.30
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 5910.81 3rd Qu.:7.420e+06 3rd Qu.: 7.20
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## NA's :448 NA's :652 NA's :34
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.50 1st Qu.:0.4930 1st Qu.:10.10
## Median : 3.30 Median :0.6770 Median :12.30
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7790 3rd Qu.:14.30
## Max. :28.60 Max. :0.9480 Max. :20.70
## NA's :34 NA's :167 NA's :163
As we can see, we have a lot of missing value in our dataset. We get some variables, such as population with over 652 missing values! That is plenty. Also, there are some variables that have some values equal to zero. As this is a dataset taken from the WHO, we will consider that the zeros are not a mistake, but specific cases of countries where they registered that specific measure.
data$Year <- as.factor(data$Year)
sum(apply(data, 1, anyNA))
## [1] 1289
As we can see, there are 1289 rows being affected by Nans. This is a huge problem, as it is almost 30% of ours rows, meaning that it is not a good idea to drop so much information.
First, I will fill the missing values of every country, with the same information as the past (or next) year from the same country. This way, there will be consistency in eachs countries measures. This is important, as doing any other imputation with, for instance, the median or mean of the column is not a good idea. If I do this, there may be some countries that get a very ilogical value. What is more, as the countries that have many missing values are mostly very small countries (probably, the smallest in the whoele dataset) imputing in another way may cause that they will for sure get a very much larger value than they should.
Let´s fill some missing values with other values from past or forward years, for the same countries:
data <- data %>%
dplyr::group_by(Country) %>%
fill(Population,GDP,Life.expectancy,Schooling,Income.composition.of.resources,thinness..1.19.years,thinness.5.9.years,Total.expenditure,Hepatitis.B,BMI,Alcohol,Adult.Mortality, .direction = "downup") %>%
dplyr::ungroup()
Let´s how many missing values are still in the dataset.
sum(apply(data, 1, anyNA))
## [1] 818
We have fixed some missing values. Nevertheless, we still have 818 rows with some of them. It is easy to tell that the biggest problems are with the variables Population and GDP. Let´s see what countries have no information whatsoever regarding their population and drop them.
bygroup <- aggregate(Population ~ Country, data=data, function(x) {sum(is.na(x))}, na.action = NULL)
Now that I grouped per countries, let´s see if there is any country that has, for example, no value at all for Population column (in any year).
no_info <- bygroup$Country[bygroup$Population ==16]
no_info
## [1] "Antigua and Barbuda"
## [2] "Bahamas"
## [3] "Bahrain"
## [4] "Barbados"
## [5] "Bolivia (Plurinational State of)"
## [6] "Brunei Darussalam"
## [7] "Congo"
## [8] "Côte d'Ivoire"
## [9] "Cuba"
## [10] "Czechia"
## [11] "Democratic People's Republic of Korea"
## [12] "Democratic Republic of the Congo"
## [13] "Egypt"
## [14] "Gambia"
## [15] "Grenada"
## [16] "Iran (Islamic Republic of)"
## [17] "Kuwait"
## [18] "Kyrgyzstan"
## [19] "Lao People's Democratic Republic"
## [20] "Libya"
## [21] "Micronesia (Federated States of)"
## [22] "New Zealand"
## [23] "Oman"
## [24] "Qatar"
## [25] "Republic of Korea"
## [26] "Republic of Moldova"
## [27] "Saint Lucia"
## [28] "Saint Vincent and the Grenadines"
## [29] "Saudi Arabia"
## [30] "Singapore"
## [31] "Slovakia"
## [32] "Somalia"
## [33] "The former Yugoslav republic of Macedonia"
## [34] "United Arab Emirates"
## [35] "United Kingdom of Great Britain and Northern Ireland"
## [36] "United Republic of Tanzania"
## [37] "United States of America"
## [38] "Venezuela (Bolivarian Republic of)"
## [39] "Viet Nam"
## [40] "Yemen"
There are 40 countries with no value for population whtsoever. When taking a closer look, we can see that most of the missing values in our dataset come from these countries in the list. So I decide to take them out.
data <- filter(data, !Country %in% no_info)
summary(data)
## Country Year Status Life.expectancy
## Length:2298 2013 : 153 Length:2298 Min. :36.30
## Class :character 2000 : 143 Class :character 1st Qu.:62.20
## Mode :character 2001 : 143 Mode :character Median :71.40
## 2002 : 143 Mean :68.68
## 2003 : 143 3rd Qu.:75.40
## 2004 : 143 Max. :89.00
## (Other):1430 NA's :10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.00 Min. : 0.010 Min. : 0.00
## 1st Qu.: 72.0 1st Qu.: 0.00 1st Qu.: 0.680 1st Qu.: 20.63
## Median :147.0 Median : 3.00 Median : 3.880 Median : 93.44
## Mean :170.3 Mean : 33.91 Mean : 4.578 Mean : 827.36
## 3rd Qu.:237.0 3rd Qu.: 23.00 3rd Qu.: 7.450 3rd Qu.: 490.23
## Max. :723.0 Max. :1800.00 Max. :17.870 Max. :19479.91
## NA's :10 NA's :17
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.00 Min. : 0.0 Min. : 1.40 Min. : 0.00
## 1st Qu.:64.00 1st Qu.: 0.0 1st Qu.:18.70 1st Qu.: 1.00
## Median :86.00 Median : 17.0 Median :42.10 Median : 4.00
## Mean :73.83 Mean : 2535.6 Mean :37.53 Mean : 47.14
## 3rd Qu.:95.00 3rd Qu.: 433.2 3rd Qu.:55.80 3rd Qu.: 33.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :128 NA's :34
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:75.00 1st Qu.: 4.343 1st Qu.:77.00 1st Qu.: 0.100
## Median :92.00 Median : 5.855 Median :92.00 Median : 0.100
## Mean :81.21 Mean : 5.997 Mean :81.31 Mean : 2.046
## 3rd Qu.:96.00 3rd Qu.: 7.690 3rd Qu.:96.00 3rd Qu.: 1.100
## Max. :99.00 Max. :17.240 Max. :99.00 Max. :50.600
## NA's :19 NA's :19
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.100
## 1st Qu.: 431.93 1st Qu.:1.966e+05 1st Qu.: 1.500
## Median : 1488.47 Median :1.382e+06 Median : 2.900
## Mean : 6558.13 Mean :1.273e+07 Mean : 4.874
## 3rd Qu.: 4974.27 3rd Qu.:7.408e+06 3rd Qu.: 7.300
## Max. :119172.74 Max. :1.294e+09 Max. :27.700
## NA's :5 NA's :8 NA's :34
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.100 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.500 1st Qu.:0.4795 1st Qu.: 9.80
## Median : 3.100 Median :0.6620 Median :12.10
## Mean : 4.933 Mean :0.6184 Mean :11.85
## 3rd Qu.: 7.400 3rd Qu.:0.7665 3rd Qu.:14.30
## Max. :28.600 Max. :0.9480 Max. :20.70
## NA's :34 NA's :7 NA's :3
dim(data)
## [1] 2298 22
sum(apply(data, 1, anyNA))
## [1] 178
As we can see, now we only have 178 missing values. We have lost some information by deleting so many rows, but we still have 2298 rows. Now we can do an imputation with the library “mice”.
data <- mice(data,m=1,method="cart")
##
## iter imp variable
## 1 1 Life.expectancy Adult.Mortality Alcohol Hepatitis.B BMI Polio Diphtheria GDP Population thinness..1.19.years thinness.5.9.years Income.composition.of.resources Schooling
## 2 1 Life.expectancy Adult.Mortality Alcohol Hepatitis.B BMI Polio Diphtheria GDP Population thinness..1.19.years thinness.5.9.years Income.composition.of.resources Schooling
## 3 1 Life.expectancy Adult.Mortality Alcohol Hepatitis.B BMI Polio Diphtheria GDP Population thinness..1.19.years thinness.5.9.years Income.composition.of.resources Schooling
## 4 1 Life.expectancy Adult.Mortality Alcohol Hepatitis.B BMI Polio Diphtheria GDP Population thinness..1.19.years thinness.5.9.years Income.composition.of.resources Schooling
## 5 1 Life.expectancy Adult.Mortality Alcohol Hepatitis.B BMI Polio Diphtheria GDP Population thinness..1.19.years thinness.5.9.years Income.composition.of.resources Schooling
## Warning: Number of logged events: 2
data <- complete(data)
attach(data)
Now, we don´t have any more missing values, we can keep on working with our dataset.
As explained before, let´s first drop the first and second column of the data frame.
dato <- data[,3:22]
dim(dato)
## [1] 2298 20
head(dato)
## Status Life.expectancy Adult.Mortality infant.deaths Alcohol
## 1 Developing 65.0 263 62 0.01
## 2 Developing 59.9 271 64 0.01
## 3 Developing 59.9 268 66 0.01
## 4 Developing 59.5 272 69 0.01
## 5 Developing 59.2 275 71 0.01
## 6 Developing 58.8 279 74 0.01
## percentage.expenditure Hepatitis.B Measles BMI under.five.deaths Polio
## 1 71.279624 65 1154 19.1 83 6
## 2 73.523582 62 492 18.6 86 58
## 3 73.219243 64 430 18.1 89 62
## 4 78.184215 67 2787 17.6 93 67
## 5 7.097109 68 3013 17.2 97 68
## 6 79.679367 66 1989 16.7 102 66
## Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 8.16 65 0.1 584.25921 33736494
## 2 8.18 62 0.1 612.69651 327582
## 3 8.13 64 0.1 631.74498 31731688
## 4 8.52 67 0.1 669.95900 3696958
## 5 7.87 68 0.1 63.53723 2978599
## 6 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
summary(dato)
## Status Life.expectancy Adult.Mortality infant.deaths
## Length:2298 Min. :36.30 Min. : 1.0 Min. : 0.00
## Class :character 1st Qu.:62.30 1st Qu.: 72.0 1st Qu.: 0.00
## Mode :character Median :71.40 Median :146.0 Median : 3.00
## Mean :68.69 Mean :170.1 Mean : 33.91
## 3rd Qu.:75.40 3rd Qu.:237.0 3rd Qu.: 23.00
## Max. :89.00 Max. :723.0 Max. :1800.00
## Alcohol percentage.expenditure Hepatitis.B Measles
## Min. : 0.010 Min. : 0.00 Min. : 2.00 Min. : 0.0
## 1st Qu.: 0.620 1st Qu.: 20.63 1st Qu.:63.00 1st Qu.: 0.0
## Median : 3.835 Median : 93.44 Median :86.00 Median : 17.0
## Mean : 4.553 Mean : 827.36 Mean :72.67 Mean : 2535.6
## 3rd Qu.: 7.440 3rd Qu.: 490.23 3rd Qu.:95.00 3rd Qu.: 433.2
## Max. :17.870 Max. :19479.91 Max. :99.00 Max. :212183.0
## BMI under.five.deaths Polio Total.expenditure
## Min. : 1.40 Min. : 0.00 Min. : 3.00 Min. : 0.370
## 1st Qu.:18.70 1st Qu.: 1.00 1st Qu.:75.00 1st Qu.: 4.343
## Median :41.40 Median : 4.00 Median :92.00 Median : 5.855
## Mean :37.34 Mean : 47.14 Mean :81.22 Mean : 5.997
## 3rd Qu.:55.70 3rd Qu.: 33.00 3rd Qu.:96.00 3rd Qu.: 7.690
## Max. :87.30 Max. :2500.00 Max. :99.00 Max. :17.240
## Diphtheria HIV.AIDS GDP Population
## Min. : 2.00 Min. : 0.100 Min. : 1.68 Min. :3.400e+01
## 1st Qu.:77.00 1st Qu.: 0.100 1st Qu.: 430.27 1st Qu.:1.966e+05
## Median :92.00 Median : 0.100 Median : 1483.93 Median :1.371e+06
## Mean :81.27 Mean : 2.046 Mean : 6548.41 Mean :1.269e+07
## 3rd Qu.:96.00 3rd Qu.: 1.100 3rd Qu.: 4972.20 3rd Qu.:7.388e+06
## Max. :99.00 Max. :50.600 Max. :119172.74 Max. :1.294e+09
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## Min. : 0.100 Min. : 0.100 Min. :0.0000
## 1st Qu.: 1.500 1st Qu.: 1.500 1st Qu.:0.4790
## Median : 3.000 Median : 3.100 Median :0.6620
## Mean : 4.895 Mean : 4.963 Mean :0.6173
## 3rd Qu.: 7.400 3rd Qu.: 7.400 3rd Qu.:0.7660
## Max. :27.700 Max. :28.600 Max. :0.9480
## Schooling
## Min. : 0.00
## 1st Qu.: 9.80
## Median :12.10
## Mean :11.84
## 3rd Qu.:14.30
## Max. :20.70
As we can see I have now 19 predictors, from which 18 are continuous and only 1 is categorical. Here is an explanation of them:
As said before, my purpose is to predict the life expectancy of a country. Let´s first explore this variable.
describe(dato$Life.expectancy)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2298 68.69 9.8 71.4 69.28 9.49 36.3 89 52.7 -0.53 -0.37 0.2
We can first see that from all our data set the minimum value of life expectancy is 36.30, while the maximum is 89. In average, the countries through the years have a life expectancy of 68.69. About the variability, the standrd deviation is 9.8. Its kurtosis is -0.37. This negative value means our variable has a “platykurtic” distribution, which means that it has a flatter peak and thinner tails compared to a normal distribution. This simply means that more data values are located near the mean and less data values are located on the tails.
Let´s plot the target variable.
ggplot(dato, aes(x=Life.expectancy)) +
geom_histogram(aes(y=..density..), colour=color_2, fill=color_3)+
geom_density(alpha=0.3, fill=color_1)+
labs(x = "Life Expectancy", y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here we can see that our target variable has a distribution that can be related to a normal, but as said before with more values close to the mean and not so many in the tails (maybe a gamma distribution).
features <- setdiff(colnames(dato),'data')
numeric_features <- setdiff(colnames(dato[,2:20]),'dato')
Let´s first plot an histogram to see how their univariate distribution behaves:
for (f in numeric_features) {
hist(dato[,f], main=f)
}
As we have only one categorical variable (Status) let´s see what is its relationship with our target variable.
ggplot(dato, aes(x=Status, y=Life.expectancy, fill=Status)) +
geom_boxplot()
As it can be seen in the plot, developed country have a larger life expectancy median, and its distribution is much more concentrated. Meaning that there is no so much variability between developed countries. It seems that the median is close to 80. On the other hand, developing countries have a less concentrated life expectancy distribution (the box is larger), having a median lower than 70. This way, we can see that there is a huge difference between developed and developing countries. This should not surprise us, as life expectancy is one of the variables to look at when assesing if a country is (or is not) a developed country.
As I will be using CARET to model and predict my target variable, there is no need of standarization or scaling now, as the package can do it when running the algorithm.
Let´s start to asses the relationship between the variables in the dataset.
First, let´s check the autocorrelation between the variables.
correlation <- corrplot(cor(dato[,2:20]),type="upper",method="circle",title="Correlation plot",mar=c(0.1,0.1,0.1,0.1))
As we can see, there are some variables that have a strong correlation. For exapmle, there is a perfect correlation between “under five year death” and “infant mortality”. This is because they measure almost the same, but with a slight different age interval. There is also a strong correlation between “GDP” and the “percentage of expenditure”. This is also logical: the richer the country, the bigger the percentage of GDP destined to the health sector.
Let´s plot “infants deaths” vs. “under five deaths”:
ggplot(dato, aes(x=under.five.deaths, y=infant.deaths)) +
geom_point(size=2, shape=23)+ geom_abline(intercept = 0, slope = 1, colour = "blue")
Let´s see what happens when trying to calculate the VIF of a multiple regression when taking into account all the covariates (to check if the two variables will be producing multicolinearity). Here, I will perform a multiple regression but without splitting the data into training and testing set, as I do not want to predict yet but to asses for multicolinearity.
multiple.lm <- lm(Life.expectancy ~., data=dato)
vif(multiple.lm)
## Status Adult.Mortality
## 1.995091 1.762713
## infant.deaths Alcohol
## 178.956741 2.151705
## percentage.expenditure Hepatitis.B
## 8.516394 1.477716
## Measles BMI
## 1.422992 1.858091
## under.five.deaths Polio
## 178.037256 1.932098
## Total.expenditure Diphtheria
## 1.174575 2.267897
## HIV.AIDS GDP
## 1.453274 9.241228
## Population thinness..1.19.years
## 1.494111 7.866809
## thinness.5.9.years Income.composition.of.resources
## 7.957444 3.690749
## Schooling
## 4.137874
As we can see, the VIF of “under five year death” and “infant mortality” are extremely high, meaning that they probably meassure almost the same thing. All together, I will drop the “under five year death” as between both of them is the one that has the less correlation with the target variable and this way, we will be avoiding multicolinearity.
datos <- dato[,-10]
setdiff(colnames(datos),'data')
## [1] "Status" "Life.expectancy"
## [3] "Adult.Mortality" "infant.deaths"
## [5] "Alcohol" "percentage.expenditure"
## [7] "Hepatitis.B" "Measles"
## [9] "BMI" "Polio"
## [11] "Total.expenditure" "Diphtheria"
## [13] "HIV.AIDS" "GDP"
## [15] "Population" "thinness..1.19.years"
## [17] "thinness.5.9.years" "Income.composition.of.resources"
## [19] "Schooling"
X_matr <- data.matrix(datos)
comboInfo <- findLinearCombos(X_matr)
comboInfo
## $linearCombos
## list()
##
## $remove
## NULL
There are no columns that can be expressed as linear combination of other columns. This is fundamental to later have no problems in inverting the variance-covariance matrix when performing regression.
First, let´s divide the training and testing set with caret.
in_train <- createDataPartition((datos$Life.expectancy), p = 0.8, list = FALSE) # 80% for training
training <- datos[ in_train,]
testing <- datos[-in_train,]
nrow(training)
## [1] 1841
nrow(testing)
## [1] 457
control <- trainControl(method="cv", number=10)
Let´s explore the traning and testing set to be sure there is no huge bias in any of them.
ggplot(training, aes(x=Status, y=Life.expectancy, fill=Status)) +
geom_boxplot()
ggplot(testing, aes(x=Status, y=Life.expectancy, fill=Status)) +
geom_boxplot()
As far as we can tell from these boxplots, the training and testing set look well balanced: among developed and developing countries and life expectancy distributions.
Let´s check what are the most correlated variables with “life expectancy”:
corr_life <- sort(cor(training[,2:19])["Life.expectancy",], decreasing = T)
corr=data.frame(corr_life)
ggplot(corr,aes(x = row.names(corr), y = corr_life)) +
geom_bar(stat = "identity", fill = "lightblue") +
scale_x_discrete(limits= row.names(corr)) +
labs(x = "Predictors", y = "Life Expectancy", title = "Correlations") +
theme(plot.title = element_text(hjust = 0, size = rel(1.5)),
axis.text.x = element_text(angle = 45, hjust = 1))
As we can see, the variable that has the strongest correlation is Schooling, income composition of resources, followed by BMI, and GDP. On the other hand, there are some variables that hold a strong negative correlation with our target, such as Adult mortality (logicaly), HIV (also logicaly) and other variables that are defined by illnesses that may cause death. The only variable that apparently has no correlation with life expectancy is “Population”. Nevertheless, correlation does not necesarily imply causality, so we can not yet asses which of these predictors helps us best to predict life expectancy in a country.
For this project, I will use a benchmark in order to know how well or bad am I doing when bilding my predictive models. In this case, I will use a multiple regression with the three more correlated variables to the target: “schooling”, “income composition of resources” and “adult mortality”.
benchFit <- lm(Life.expectancy ~ Schooling + Income.composition.of.resources + Adult.Mortality, data=training)
predictions <- predict(benchFit, newdata=testing)
cor((testing$Life.expectancy), predictions)^2
## [1] 0.7302342
MAE <- mean(abs(predictions - testing$Life.expectancy))
MAE
## [1] 3.49089
As we can see, for our benchmark we get an R-square equal to 0.7734, meaning that we can explain 77,34% of the variability of the dataset, and our mean absolute error is of 3.2528.
Let´s first see what happens when we perform a multiple linear regression, taking into account all the covariates we have, and see what is the importance of each variable.
fit.lm <- lm(Life.expectancy ~., data=training)
summary(fit.lm)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.5179 -2.3224 0.0532 2.3488 19.2881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.241e+01 7.605e-01 68.913 < 2e-16 ***
## StatusDeveloping -1.365e+00 3.471e-01 -3.932 8.74e-05 ***
## Adult.Mortality -1.522e-02 9.786e-04 -15.550 < 2e-16 ***
## infant.deaths 2.922e-04 1.113e-03 0.262 0.793007
## Alcohol -1.688e-01 3.332e-02 -5.065 4.50e-07 ***
## percentage.expenditure 1.045e-04 1.208e-04 0.865 0.387160
## Hepatitis.B -6.518e-03 3.817e-03 -1.707 0.087906 .
## Measles -1.618e-05 1.045e-05 -1.548 0.121902
## BMI 4.931e-02 6.441e-03 7.656 3.10e-14 ***
## Polio 2.069e-02 5.544e-03 3.731 0.000196 ***
## Total.expenditure 6.224e-02 4.227e-02 1.473 0.141053
## Diphtheria 3.066e-02 6.037e-03 5.079 4.18e-07 ***
## HIV.AIDS -4.840e-01 2.030e-02 -23.844 < 2e-16 ***
## GDP 2.636e-05 2.025e-05 1.302 0.193091
## Population 2.899e-10 1.846e-09 0.157 0.875229
## thinness..1.19.years -9.903e-02 5.777e-02 -1.714 0.086666 .
## thinness.5.9.years 2.758e-02 5.644e-02 0.489 0.625167
## Income.composition.of.resources 1.009e+01 8.516e-01 11.845 < 2e-16 ***
## Schooling 8.173e-01 5.524e-02 14.797 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.042 on 1822 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8314
## F-statistic: 505 on 18 and 1822 DF, p-value: < 2.2e-16
lm_imp <- varImp(fit.lm, scale = F)
lm_imp
## Overall
## StatusDeveloping 3.9320310
## Adult.Mortality 15.5498067
## infant.deaths 0.2624459
## Alcohol 5.0650736
## percentage.expenditure 0.8649873
## Hepatitis.B 1.7074671
## Measles 1.5475578
## BMI 7.6555153
## Polio 3.7312879
## Total.expenditure 1.4725201
## Diphtheria 5.0791626
## HIV.AIDS 23.8437662
## GDP 1.3019700
## Population 0.1570423
## thinness..1.19.years 1.7141778
## thinness.5.9.years 0.4886230
## Income.composition.of.resources 11.8450486
## Schooling 14.7967152
According to this simple approach, the most important variable would be “HIV”, followed by “Adult Mortality”, “Schooling” and “Income Composition of Resources”. This is interesting as these were the variables that showed a strong correlation with the target variable. This way, we are accounting for the largest t-values in the regression. Let´s see what happens when we are interested in lowest p-values instead:
summary(fit.lm)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.5179 -2.3224 0.0532 2.3488 19.2881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.241e+01 7.605e-01 68.913 < 2e-16 ***
## StatusDeveloping -1.365e+00 3.471e-01 -3.932 8.74e-05 ***
## Adult.Mortality -1.522e-02 9.786e-04 -15.550 < 2e-16 ***
## infant.deaths 2.922e-04 1.113e-03 0.262 0.793007
## Alcohol -1.688e-01 3.332e-02 -5.065 4.50e-07 ***
## percentage.expenditure 1.045e-04 1.208e-04 0.865 0.387160
## Hepatitis.B -6.518e-03 3.817e-03 -1.707 0.087906 .
## Measles -1.618e-05 1.045e-05 -1.548 0.121902
## BMI 4.931e-02 6.441e-03 7.656 3.10e-14 ***
## Polio 2.069e-02 5.544e-03 3.731 0.000196 ***
## Total.expenditure 6.224e-02 4.227e-02 1.473 0.141053
## Diphtheria 3.066e-02 6.037e-03 5.079 4.18e-07 ***
## HIV.AIDS -4.840e-01 2.030e-02 -23.844 < 2e-16 ***
## GDP 2.636e-05 2.025e-05 1.302 0.193091
## Population 2.899e-10 1.846e-09 0.157 0.875229
## thinness..1.19.years -9.903e-02 5.777e-02 -1.714 0.086666 .
## thinness.5.9.years 2.758e-02 5.644e-02 0.489 0.625167
## Income.composition.of.resources 1.009e+01 8.516e-01 11.845 < 2e-16 ***
## Schooling 8.173e-01 5.524e-02 14.797 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.042 on 1822 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8314
## F-statistic: 505 on 18 and 1822 DF, p-value: < 2.2e-16
As we can see, through this approach we would say that the most important variables would be the “Development Status”, “Adult Mortality”, “Infant Deaths”, Percentage Expenditure“,”BMI“,”Polio“,”Diphteria“,”HIV“,”Income Composition of resources" and “Schooling”. Between these, the lower p-values are of “Income Composition of resources”, “Schooling”, “HIV”, “BMI” and “Adult Mortality”. Results are similar to the ones obtained before, although there are some differences. Let´s take into account that our adjusted R squared is 0.837, so we are explaining big part of the variability in the dataset.
Let´s check the dignostic plots to asses if the assumptions of the model are being violated or not.
plot(fit.lm)
As we can see in the plots, the residuals seem to be not very well centered around zero, and there is clearly some heterocedasticity (non constant variance). We can say that the lineality assumption is kind of violated here. For the most part, they seem to behave as gaussian (more or less) and there seems to be some possible outliers (although there are no points outside the red line, on the corners of the forth plot, there are some points that are close it).
prLM <- predict(fit.lm, newdata=testing)
cor((testing$Life.expectancy), prLM)^2
## [1] 0.8364015
mean(abs((testing$Life.expectancy)- prLM))
## [1] 2.968201
As we can see, when taking into account all the covariates we get a better result than our benchmark. Now, we got a better R-squared, at 0.8132 and our mean absolute error has decreased from 3.2528 to 3.14. This is a good start. But from now on, I will also start paying attention to the adjusted R-squared, as I would like to penalize for the amount of variables used.
I will also run this regression through Caret Package, to be able to do a nicer comparison in the future between all predictive models.
lm1 <- train(Life.expectancy ~ .,data=training, method='lm',
preProcess=c("scale","center"), trControl=control)
test_results <- data.frame(Life.expectancy = (testing$Life.expectancy))
test_results$lm1 <- predict(lm1, testing)
postResample(pred = test_results$lm1, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.9043953 0.8364015 2.9682015
With CARET we have an even better model to predict, as it allows us to do some automatic transformations to the data: now by scaling and centering it. Let´s see a plot of the observed vs. the predicted values of the life expectancy in the testing set:
qplot(test_results$lm1, test_results$Life.expectancy) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
Here we can see that there is some noise in the predictions. Even though the results seem to be fitting a linear tendency, there are some big residuals, specially in the corner.
Let´s see what happens when we try again a regression with all the covariates, but now using a general linear model, accounting for the skewness in our target variable: we will consider ir a gamma. This will give us a hint on how to deal with our target variable.
lm2 <- train(Life.expectancy ~ .,data=training, method='glm',family="Gamma",
preProcess=c("scale","center"), trControl=control)
test_results <- data.frame(Life.expectancy = (testing$Life.expectancy))
test_results$lm2 <- predict(lm2, testing)
postResample(pred = test_results$lm2, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.8984987 0.8369797 2.9420461
When using a gamma family for the glm, we get even a better result than when assuming a gaussian distribution for our target variable. This way, we get an R squared equal to 0.8405 (higher than 0.8384 obtained with the linear model before), a mean absolute error equal to 2.9160 (lower than the 2.9382 from before) and also a lower RMSE. Until now, this is our best model.
Now, I will perform two different robust regressions, with all the covariates I have. Robust regression is a special kind of linear regression, where we give different weights to each point, assigned by a specific weighting function. I will use a Huber and a Bisquare to account for possible outliers:
linfit.huber <- rlm(Life.expectancy ~ . , data=training)
summary(linfit.huber)
##
## Call: rlm(formula = Life.expectancy ~ ., data = training)
## Residuals:
## Min 1Q Median 3Q Max
## -20.71981 -2.25319 0.02124 2.25871 20.34708
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 52.2654 0.6743 77.5125
## StatusDeveloping -0.9857 0.3078 -3.2028
## Adult.Mortality -0.0167 0.0009 -19.2293
## infant.deaths 0.0006 0.0010 0.5610
## Alcohol -0.1559 0.0295 -5.2773
## percentage.expenditure 0.0001 0.0001 1.1980
## Hepatitis.B -0.0020 0.0034 -0.5771
## Measles 0.0000 0.0000 -1.2462
## BMI 0.0338 0.0057 5.9199
## Polio 0.0165 0.0049 3.3624
## Total.expenditure 0.0755 0.0375 2.0154
## Diphtheria 0.0280 0.0054 5.2309
## HIV.AIDS -0.4744 0.0180 -26.3588
## GDP 0.0000 0.0000 0.8903
## Population 0.0000 0.0000 -0.2124
## thinness..1.19.years -0.0880 0.0512 -1.7175
## thinness.5.9.years 0.0112 0.0500 0.2231
## Income.composition.of.resources 12.5198 0.7550 16.5831
## Schooling 0.7576 0.0490 15.4696
##
## Residual standard error: 3.342 on 1822 degrees of freedom
linfit.bisquare <- rlm(Life.expectancy ~ . , data=training, psi = psi.bisquare)
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
summary(linfit.bisquare)
##
## Call: rlm(formula = Life.expectancy ~ ., data = training, psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -20.43139 -2.21021 0.01212 2.23254 21.25097
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 51.5547 0.6834 75.4417
## StatusDeveloping -0.9119 0.3119 -2.9235
## Adult.Mortality -0.0158 0.0009 -17.9907
## infant.deaths 0.0005 0.0010 0.5408
## Alcohol -0.1567 0.0299 -5.2347
## percentage.expenditure 0.0001 0.0001 1.2845
## Hepatitis.B -0.0003 0.0034 -0.0939
## Measles 0.0000 0.0000 -0.8400
## BMI 0.0285 0.0058 4.9282
## Polio 0.0142 0.0050 2.8481
## Total.expenditure 0.0868 0.0380 2.2847
## Diphtheria 0.0272 0.0054 5.0110
## HIV.AIDS -0.4878 0.0182 -26.7448
## GDP 0.0000 0.0000 0.7100
## Population 0.0000 0.0000 -0.3310
## thinness..1.19.years -0.0738 0.0519 -1.4225
## thinness.5.9.years 0.0026 0.0507 0.0520
## Income.composition.of.resources 14.2728 0.7652 18.6535
## Schooling 0.7287 0.0496 14.6830
##
## Residual standard error: 3.291 on 1822 degrees of freedom
Let´s see how do they predict.
prHuber <- predict(linfit.huber, newdata=testing)
cor((testing$Life.expectancy), prHuber)^2
## [1] 0.8309922
mean(abs((testing$Life.expectancy)- prHuber))
## [1] 2.945815
prBi <- predict(linfit.bisquare, newdata=testing)
cor((testing$Life.expectancy), prBi)^2
## [1] 0.825321
mean(abs((testing$Life.expectancy)- prBi))
## [1] 2.96449
Here we get an interesting result. Our R sqaured are lower in comparison to what we got in our previous models, but we predict better in our testing set with both of them. Between them two, apparently the best model to predict is the one with huber function, although they both have very similar results.
Let´s try and get a more parsimonious model, when trying also to asses what is maybe the best model given our goal.
First, let´s do a forward based ordinary least of squares regression:
forwardpv <- ols_step_forward_p(fit.lm) # forward based on p-value
plot(forwardpv)
forwardpv$model
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Coefficients:
## (Intercept) Income.composition.of.resources
## 5.244e+01 1.011e+01
## Schooling HIV.AIDS
## 8.144e-01 -4.844e-01
## Adult.Mortality BMI
## -1.522e-02 4.903e-02
## Diphtheria GDP
## 3.089e-02 4.202e-05
## Polio Alcohol
## 2.059e-02 -1.679e-01
## StatusDeveloping thinness..1.19.years
## -1.368e+00 -6.875e-02
## Hepatitis.B Measles
## -7.057e-03 -1.422e-05
## Total.expenditure
## 6.345e-02
As we can see, there are 15 variables taken here into account. Let´s see how it predictss:
leap1 <- predict(forwardpv$model, newdata=testing)
cor((testing$Life.expectancy), leap1)^2
## [1] 0.8365429
mean(abs((testing$Life.expectancy)- leap1))
## [1] 2.969078
Results are not better than the ones we already got, although they are close and with less variables used (has the highest R-squared yet but not the lowest MAE).
Let´s now try using the AKAIKE criteria.
First, let´s do forward based elimination, guiding by the AIC criterion:
forwaic <- ols_step_forward_aic(fit.lm) # forward based on AIC
forwaic$model
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Coefficients:
## (Intercept) Schooling
## 5.244e+01 8.144e-01
## HIV.AIDS Adult.Mortality
## -4.844e-01 -1.522e-02
## Income.composition.of.resources BMI
## 1.011e+01 4.903e-02
## Diphtheria GDP
## 3.089e-02 4.202e-05
## Polio Alcohol
## 2.059e-02 -1.679e-01
## StatusDeveloping thinness..1.19.years
## -1.368e+00 -6.875e-02
## Hepatitis.B Measles
## -7.057e-03 -1.422e-05
## Total.expenditure
## 6.345e-02
forwpred <- predict(forwaic$model, newdata=testing)
cor((testing$Life.expectancy), forwpred)^2
## [1] 0.8365429
mean(abs((testing$Life.expectancy)- forwpred))
## [1] 2.969078
Now, let´s do the same but with backward elimination, also guiding by the AIC criterion:
backaic <- ols_step_backward_aic(fit.lm) # backward AIC
backaic$model
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Coefficients:
## (Intercept) StatusDeveloping
## 5.244e+01 -1.368e+00
## Adult.Mortality Alcohol
## -1.522e-02 -1.679e-01
## Hepatitis.B Measles
## -7.057e-03 -1.422e-05
## BMI Polio
## 4.903e-02 2.059e-02
## Total.expenditure Diphtheria
## 6.345e-02 3.089e-02
## HIV.AIDS GDP
## -4.844e-01 4.202e-05
## thinness..1.19.years Income.composition.of.resources
## -6.875e-02 1.011e+01
## Schooling
## 8.144e-01
backpred <- predict(backaic$model, newdata=testing)
cor((testing$Life.expectancy), backpred)^2
## [1] 0.8365429
mean(abs((testing$Life.expectancy)- backpred))
## [1] 2.969078
As I get both with the backward and forward process of elimination the same result, there is no point in doing the bothsides step akaike elimination. In both cases, I get to a more parsimonious model in comparison with the one of full covariates, but we do not beat the previous models. Here we get an r-squared of 0.8381 and a Mean Absolute Error of 2.94.
Nevertheless, I am interested in finding what happens when the same is done, but accounting for interactions. Now it is highly likely we will get a huge model with many terms, as I will try every possible two-sided interaction (again forward, backwards and both ways, always guided by AIC criterion).
First forward:
forwint <- step(fit.lm, scope = . ~ .^2, direction = 'forward')
summary(forwint)
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + Polio + Total.expenditure + Diphtheria + HIV.AIDS +
## GDP + Population + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +
## BMI:Diphtheria + thinness..1.19.years:thinness.5.9.years +
## percentage.expenditure:Income.composition.of.resources +
## Income.composition.of.resources:Schooling + BMI:Schooling +
## Adult.Mortality:Schooling + HIV.AIDS:Income.composition.of.resources +
## Total.expenditure:thinness..1.19.years + Alcohol:thinness.5.9.years +
## Status:Schooling + Status:Income.composition.of.resources +
## Status:Adult.Mortality + Measles:HIV.AIDS + Alcohol:HIV.AIDS +
## Alcohol:GDP + Adult.Mortality:Diphtheria + Adult.Mortality:Alcohol +
## Polio:Schooling + Diphtheria:Income.composition.of.resources +
## Status:BMI + Hepatitis.B:Total.expenditure + Total.expenditure:HIV.AIDS +
## Adult.Mortality:Total.expenditure + percentage.expenditure:HIV.AIDS +
## Alcohol:Total.expenditure + percentage.expenditure:thinness.5.9.years +
## Status:Alcohol + Status:thinness..1.19.years + Alcohol:Schooling +
## Status:GDP + percentage.expenditure:GDP + Diphtheria:Schooling +
## HIV.AIDS:Schooling + HIV.AIDS:thinness..1.19.years + Measles:BMI +
## Status:infant.deaths + Measles:Polio + infant.deaths:Total.expenditure +
## Total.expenditure:Population + Alcohol:Population + Total.expenditure:Diphtheria +
## Alcohol:Polio + Hepatitis.B:thinness..1.19.years + Alcohol:Income.composition.of.resources +
## Diphtheria:thinness.5.9.years + Hepatitis.B:thinness.5.9.years +
## BMI:Polio + infant.deaths:Hepatitis.B + infant.deaths:Population +
## Total.expenditure:Income.composition.of.resources + BMI:GDP +
## Status:Total.expenditure + Hepatitis.B:Diphtheria + Measles:Diphtheria +
## infant.deaths:Diphtheria + infant.deaths:Polio + infant.deaths:Income.composition.of.resources +
## Population:Income.composition.of.resources + percentage.expenditure:Population +
## Hepatitis.B:Population + Polio:GDP + Hepatitis.B:Polio +
## Total.expenditure:GDP + Adult.Mortality:percentage.expenditure +
## Adult.Mortality:GDP + Status:Diphtheria + Population:thinness.5.9.years,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2852 -1.6312 -0.0146 1.6706 11.1827
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2.402e+01 6.334e+00
## StatusDeveloping 1.755e+01 5.873e+00
## Adult.Mortality 3.489e-02 6.734e-03
## infant.deaths 4.168e-01 1.897e-01
## Alcohol -1.139e-01 2.535e-01
## percentage.expenditure 3.731e-03 1.333e-03
## Hepatitis.B -3.852e-02 1.361e-02
## Measles -2.416e-06 4.179e-05
## BMI 3.898e-01 3.657e-02
## Polio 7.262e-02 1.843e-02
## Total.expenditure 8.310e-02 2.225e-01
## Diphtheria 1.328e-01 3.450e-02
## HIV.AIDS -1.308e+00 1.475e-01
## GDP 1.710e-04 8.230e-05
## Population 3.378e-08 2.112e-08
## thinness..1.19.years -1.610e+00 3.927e-01
## thinness.5.9.years 3.266e-01 1.381e-01
## Income.composition.of.resources 3.351e+01 7.853e+00
## Schooling 3.414e-01 3.061e-01
## Adult.Mortality:HIV.AIDS 7.948e-04 7.060e-05
## BMI:Diphtheria -1.797e-03 3.310e-04
## thinness..1.19.years:thinness.5.9.years 1.900e-02 3.369e-03
## percentage.expenditure:Income.composition.of.resources -5.046e-03 1.450e-03
## Income.composition.of.resources:Schooling 8.378e-01 1.611e-01
## BMI:Schooling -1.681e-02 1.894e-03
## Adult.Mortality:Schooling -1.022e-03 3.551e-04
## HIV.AIDS:Income.composition.of.resources 7.113e-01 3.900e-01
## Total.expenditure:thinness..1.19.years -5.840e-02 1.057e-02
## Alcohol:thinness.5.9.years -4.546e-02 1.082e-02
## StatusDeveloping:Schooling 1.362e+00 2.243e-01
## StatusDeveloping:Income.composition.of.resources -4.361e+01 6.995e+00
## StatusDeveloping:Adult.Mortality -2.643e-02 5.222e-03
## Measles:HIV.AIDS -9.086e-06 2.848e-06
## Alcohol:HIV.AIDS 6.003e-02 1.036e-02
## Alcohol:GDP -6.977e-06 2.758e-06
## Adult.Mortality:Diphtheria -8.480e-05 3.006e-05
## Adult.Mortality:Alcohol -1.007e-03 2.841e-04
## Polio:Schooling -6.452e-03 1.880e-03
## Diphtheria:Income.composition.of.resources 1.096e-01 2.366e-02
## StatusDeveloping:BMI -4.839e-02 1.587e-02
## Hepatitis.B:Total.expenditure 4.738e-03 1.229e-03
## Total.expenditure:HIV.AIDS 6.310e-02 1.128e-02
## Adult.Mortality:Total.expenditure -1.044e-03 3.429e-04
## percentage.expenditure:HIV.AIDS -2.751e-04 8.857e-05
## Alcohol:Total.expenditure -4.295e-02 1.159e-02
## percentage.expenditure:thinness.5.9.years 2.773e-04 6.977e-05
## StatusDeveloping:Alcohol 2.027e-01 8.642e-02
## StatusDeveloping:thinness..1.19.years 1.318e+00 3.736e-01
## Alcohol:Schooling 5.856e-02 1.878e-02
## StatusDeveloping:GDP 7.194e-05 2.277e-05
## percentage.expenditure:GDP 6.251e-09 1.782e-09
## Diphtheria:Schooling -3.624e-03 2.131e-03
## HIV.AIDS:Schooling -4.388e-02 1.711e-02
## HIV.AIDS:thinness..1.19.years 7.464e-03 4.344e-03
## Measles:BMI -3.290e-06 1.193e-06
## StatusDeveloping:infant.deaths -4.150e-01 1.897e-01
## Measles:Polio 4.001e-06 1.193e-06
## infant.deaths:Total.expenditure -3.527e-03 1.217e-03
## Total.expenditure:Population 6.338e-09 1.675e-09
## Alcohol:Population -2.558e-09 1.007e-09
## Total.expenditure:Diphtheria -4.879e-03 1.830e-03
## Alcohol:Polio 3.063e-03 1.341e-03
## Hepatitis.B:thinness..1.19.years 4.992e-03 1.725e-03
## Alcohol:Income.composition.of.resources -8.690e-01 3.341e-01
## Diphtheria:thinness.5.9.years -2.659e-03 1.149e-03
## Hepatitis.B:thinness.5.9.years -2.453e-03 1.659e-03
## BMI:Polio 4.762e-04 2.700e-04
## infant.deaths:Hepatitis.B -8.572e-05 3.370e-05
## infant.deaths:Population 8.887e-12 1.179e-11
## Total.expenditure:Income.composition.of.resources 4.240e-01 2.235e-01
## BMI:GDP 6.358e-07 4.127e-07
## StatusDeveloping:Total.expenditure 2.105e-01 1.134e-01
## Hepatitis.B:Diphtheria 2.964e-04 1.376e-04
## Measles:Diphtheria -2.473e-06 1.092e-06
## infant.deaths:Diphtheria 1.204e-04 5.260e-05
## infant.deaths:Polio -2.304e-04 7.363e-05
## infant.deaths:Income.composition.of.resources 3.547e-02 1.214e-02
## Population:Income.composition.of.resources -1.051e-07 3.605e-08
## percentage.expenditure:Population 6.742e-12 2.948e-12
## Hepatitis.B:Population 2.069e-10 9.662e-11
## Polio:GDP -1.807e-06 7.306e-07
## Hepatitis.B:Polio -2.432e-04 1.449e-04
## Total.expenditure:GDP 4.349e-06 2.568e-06
## Adult.Mortality:percentage.expenditure 6.031e-06 2.337e-06
## Adult.Mortality:GDP -6.573e-07 3.163e-07
## StatusDeveloping:Diphtheria -3.646e-02 2.235e-02
## Population:thinness.5.9.years -6.658e-10 4.591e-10
## t value Pr(>|t|)
## (Intercept) 3.792 0.000154 ***
## StatusDeveloping 2.988 0.002849 **
## Adult.Mortality 5.182 2.45e-07 ***
## infant.deaths 2.197 0.028125 *
## Alcohol -0.449 0.653196
## percentage.expenditure 2.799 0.005184 **
## Hepatitis.B -2.830 0.004713 **
## Measles -0.058 0.953909
## BMI 10.659 < 2e-16 ***
## Polio 3.941 8.42e-05 ***
## Total.expenditure 0.373 0.708853
## Diphtheria 3.848 0.000124 ***
## HIV.AIDS -8.868 < 2e-16 ***
## GDP 2.077 0.037911 *
## Population 1.599 0.109910
## thinness..1.19.years -4.099 4.34e-05 ***
## thinness.5.9.years 2.364 0.018174 *
## Income.composition.of.resources 4.267 2.09e-05 ***
## Schooling 1.115 0.264855
## Adult.Mortality:HIV.AIDS 11.258 < 2e-16 ***
## BMI:Diphtheria -5.428 6.50e-08 ***
## thinness..1.19.years:thinness.5.9.years 5.640 1.98e-08 ***
## percentage.expenditure:Income.composition.of.resources -3.481 0.000513 ***
## Income.composition.of.resources:Schooling 5.200 2.23e-07 ***
## BMI:Schooling -8.874 < 2e-16 ***
## Adult.Mortality:Schooling -2.879 0.004042 **
## HIV.AIDS:Income.composition.of.resources 1.824 0.068350 .
## Total.expenditure:thinness..1.19.years -5.527 3.74e-08 ***
## Alcohol:thinness.5.9.years -4.202 2.78e-05 ***
## StatusDeveloping:Schooling 6.075 1.52e-09 ***
## StatusDeveloping:Income.composition.of.resources -6.234 5.67e-10 ***
## StatusDeveloping:Adult.Mortality -5.062 4.58e-07 ***
## Measles:HIV.AIDS -3.191 0.001443 **
## Alcohol:HIV.AIDS 5.793 8.17e-09 ***
## Alcohol:GDP -2.530 0.011501 *
## Adult.Mortality:Diphtheria -2.821 0.004844 **
## Adult.Mortality:Alcohol -3.546 0.000402 ***
## Polio:Schooling -3.431 0.000614 ***
## Diphtheria:Income.composition.of.resources 4.630 3.92e-06 ***
## StatusDeveloping:BMI -3.049 0.002328 **
## Hepatitis.B:Total.expenditure 3.854 0.000120 ***
## Total.expenditure:HIV.AIDS 5.594 2.57e-08 ***
## Adult.Mortality:Total.expenditure -3.044 0.002371 **
## percentage.expenditure:HIV.AIDS -3.106 0.001928 **
## Alcohol:Total.expenditure -3.705 0.000218 ***
## percentage.expenditure:thinness.5.9.years 3.974 7.35e-05 ***
## StatusDeveloping:Alcohol 2.345 0.019129 *
## StatusDeveloping:thinness..1.19.years 3.529 0.000427 ***
## Alcohol:Schooling 3.119 0.001847 **
## StatusDeveloping:GDP 3.160 0.001607 **
## percentage.expenditure:GDP 3.507 0.000464 ***
## Diphtheria:Schooling -1.701 0.089148 .
## HIV.AIDS:Schooling -2.564 0.010426 *
## HIV.AIDS:thinness..1.19.years 1.718 0.085895 .
## Measles:BMI -2.758 0.005884 **
## StatusDeveloping:infant.deaths -2.188 0.028814 *
## Measles:Polio 3.354 0.000812 ***
## infant.deaths:Total.expenditure -2.897 0.003808 **
## Total.expenditure:Population 3.785 0.000159 ***
## Alcohol:Population -2.540 0.011178 *
## Total.expenditure:Diphtheria -2.666 0.007742 **
## Alcohol:Polio 2.284 0.022478 *
## Hepatitis.B:thinness..1.19.years 2.894 0.003853 **
## Alcohol:Income.composition.of.resources -2.601 0.009381 **
## Diphtheria:thinness.5.9.years -2.315 0.020718 *
## Hepatitis.B:thinness.5.9.years -1.479 0.139424
## BMI:Polio 1.764 0.077966 .
## infant.deaths:Hepatitis.B -2.544 0.011052 *
## infant.deaths:Population 0.754 0.451133
## Total.expenditure:Income.composition.of.resources 1.897 0.057982 .
## BMI:GDP 1.540 0.123644
## StatusDeveloping:Total.expenditure 1.857 0.063445 .
## Hepatitis.B:Diphtheria 2.153 0.031420 *
## Measles:Diphtheria -2.266 0.023594 *
## infant.deaths:Diphtheria 2.289 0.022201 *
## infant.deaths:Polio -3.129 0.001785 **
## infant.deaths:Income.composition.of.resources 2.921 0.003537 **
## Population:Income.composition.of.resources -2.914 0.003611 **
## percentage.expenditure:Population 2.287 0.022291 *
## Hepatitis.B:Population 2.142 0.032361 *
## Polio:GDP -2.473 0.013500 *
## Hepatitis.B:Polio -1.679 0.093402 .
## Total.expenditure:GDP 1.693 0.090558 .
## Adult.Mortality:percentage.expenditure 2.581 0.009935 **
## Adult.Mortality:GDP -2.078 0.037856 *
## StatusDeveloping:Diphtheria -1.631 0.103096
## Population:thinness.5.9.years -1.450 0.147176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.14 on 1754 degrees of freedom
## Multiple R-squared: 0.903, Adjusted R-squared: 0.8982
## F-statistic: 189.8 on 86 and 1754 DF, p-value: < 2.2e-16
As we suspected, we have a huge (and impossible to interpret) model, given all possible interactions. Nevertheless, our R-squares is at 0.9021, that is a great sign that shows us that we are accounting for more than 90% of the total variability in the dataset with this model. Let´s see how it predicts.
test_results$forwint <- predict(forwint, testing)
postResample(pred = test_results$forwint, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.4617900 0.8734371 2.6182675
qplot(test_results$forwint, test_results$Life.expectancy) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
bothint <- step(fit.lm, scope = . ~ .^2, direction = 'both')
summary(bothint)
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + Polio + Total.expenditure + Diphtheria + HIV.AIDS +
## GDP + Population + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +
## BMI:Diphtheria + thinness..1.19.years:thinness.5.9.years +
## percentage.expenditure:Income.composition.of.resources +
## Income.composition.of.resources:Schooling + BMI:Schooling +
## Adult.Mortality:Schooling + HIV.AIDS:Income.composition.of.resources +
## Total.expenditure:thinness..1.19.years + Alcohol:thinness.5.9.years +
## Status:Schooling + Status:Income.composition.of.resources +
## Status:Adult.Mortality + Measles:HIV.AIDS + Alcohol:HIV.AIDS +
## Alcohol:GDP + Adult.Mortality:Diphtheria + Adult.Mortality:Alcohol +
## Polio:Schooling + Diphtheria:Income.composition.of.resources +
## Status:BMI + Hepatitis.B:Total.expenditure + Total.expenditure:HIV.AIDS +
## Adult.Mortality:Total.expenditure + percentage.expenditure:HIV.AIDS +
## Alcohol:Total.expenditure + percentage.expenditure:thinness.5.9.years +
## Status:Alcohol + Status:thinness..1.19.years + Alcohol:Schooling +
## Status:GDP + percentage.expenditure:GDP + HIV.AIDS:Schooling +
## HIV.AIDS:thinness..1.19.years + Measles:BMI + Status:infant.deaths +
## Measles:Polio + infant.deaths:Total.expenditure + Total.expenditure:Population +
## Alcohol:Population + Total.expenditure:Diphtheria + Alcohol:Polio +
## Hepatitis.B:thinness..1.19.years + Alcohol:Income.composition.of.resources +
## Diphtheria:thinness.5.9.years + BMI:Polio + infant.deaths:Hepatitis.B +
## Total.expenditure:Income.composition.of.resources + Status:Total.expenditure +
## Hepatitis.B:Diphtheria + Measles:Diphtheria + infant.deaths:Diphtheria +
## infant.deaths:Polio + infant.deaths:Income.composition.of.resources +
## Population:Income.composition.of.resources + percentage.expenditure:Population +
## Hepatitis.B:Population + Polio:GDP + Hepatitis.B:Polio +
## Total.expenditure:GDP + Adult.Mortality:percentage.expenditure +
## Adult.Mortality:GDP + Status:Polio + Alcohol:BMI + Hepatitis.B:thinness.5.9.years +
## Population:thinness.5.9.years + Population:thinness..1.19.years +
## infant.deaths:thinness..1.19.years, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5270 -1.6030 -0.0245 1.6431 11.0626
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2.306e+01 6.401e+00
## StatusDeveloping 1.848e+01 5.939e+00
## Adult.Mortality 3.503e-02 6.753e-03
## infant.deaths 3.775e-01 1.911e-01
## Alcohol -1.421e-01 2.552e-01
## percentage.expenditure 3.512e-03 1.317e-03
## Hepatitis.B -3.905e-02 1.350e-02
## Measles -2.305e-06 4.155e-05
## BMI 3.975e-01 3.625e-02
## Polio 1.290e-01 3.284e-02
## Total.expenditure 7.095e-02 2.224e-01
## Diphtheria 7.352e-02 2.199e-02
## HIV.AIDS -1.297e+00 1.475e-01
## GDP 2.036e-04 8.045e-05
## Population 2.271e-08 2.168e-08
## thinness..1.19.years -1.637e+00 3.936e-01
## thinness.5.9.years 3.395e-01 1.377e-01
## Income.composition.of.resources 3.462e+01 7.748e+00
## Schooling 3.458e-01 3.029e-01
## Adult.Mortality:HIV.AIDS 7.991e-04 7.060e-05
## BMI:Diphtheria -2.023e-03 3.106e-04
## thinness..1.19.years:thinness.5.9.years 2.179e-02 3.744e-03
## percentage.expenditure:Income.composition.of.resources -4.737e-03 1.423e-03
## Income.composition.of.resources:Schooling 8.077e-01 1.590e-01
## BMI:Schooling -1.745e-02 1.961e-03
## Adult.Mortality:Schooling -1.130e-03 3.549e-04
## HIV.AIDS:Income.composition.of.resources 6.690e-01 3.884e-01
## Total.expenditure:thinness..1.19.years -5.954e-02 1.067e-02
## Alcohol:thinness.5.9.years -3.809e-02 1.133e-02
## StatusDeveloping:Schooling 1.314e+00 2.252e-01
## StatusDeveloping:Income.composition.of.resources -4.351e+01 7.002e+00
## StatusDeveloping:Adult.Mortality -2.593e-02 5.227e-03
## Measles:HIV.AIDS -9.133e-06 2.844e-06
## Alcohol:HIV.AIDS 6.044e-02 1.041e-02
## Alcohol:GDP -7.187e-06 2.761e-06
## Adult.Mortality:Diphtheria -7.825e-05 3.020e-05
## Adult.Mortality:Alcohol -9.920e-04 2.845e-04
## Polio:Schooling -8.506e-03 1.529e-03
## Diphtheria:Income.composition.of.resources 9.609e-02 2.021e-02
## StatusDeveloping:BMI -4.553e-02 1.627e-02
## Hepatitis.B:Total.expenditure 4.709e-03 1.224e-03
## Total.expenditure:HIV.AIDS 6.273e-02 1.128e-02
## Adult.Mortality:Total.expenditure -1.050e-03 3.417e-04
## percentage.expenditure:HIV.AIDS -2.861e-04 8.863e-05
## Alcohol:Total.expenditure -4.257e-02 1.159e-02
## percentage.expenditure:thinness.5.9.years 2.703e-04 6.965e-05
## StatusDeveloping:Alcohol 1.782e-01 8.684e-02
## StatusDeveloping:thinness..1.19.years 1.306e+00 3.729e-01
## Alcohol:Schooling 5.379e-02 1.882e-02
## StatusDeveloping:GDP 7.551e-05 2.258e-05
## percentage.expenditure:GDP 5.726e-09 1.703e-09
## HIV.AIDS:Schooling -4.268e-02 1.701e-02
## HIV.AIDS:thinness..1.19.years 6.924e-03 4.360e-03
## Measles:BMI -3.279e-06 1.198e-06
## StatusDeveloping:infant.deaths -3.739e-01 1.911e-01
## Measles:Polio 4.130e-06 1.188e-06
## infant.deaths:Total.expenditure -3.035e-03 1.246e-03
## Total.expenditure:Population 6.723e-09 1.651e-09
## Alcohol:Population -2.766e-09 9.832e-10
## Total.expenditure:Diphtheria -4.803e-03 1.821e-03
## Alcohol:Polio 2.977e-03 1.381e-03
## Hepatitis.B:thinness..1.19.years 4.924e-03 1.708e-03
## Alcohol:Income.composition.of.resources -8.987e-01 3.345e-01
## Diphtheria:thinness.5.9.years -2.931e-03 1.145e-03
## BMI:Polio 5.953e-04 2.561e-04
## infant.deaths:Hepatitis.B -1.085e-04 3.748e-05
## Total.expenditure:Income.composition.of.resources 4.263e-01 2.230e-01
## StatusDeveloping:Total.expenditure 2.143e-01 1.132e-01
## Hepatitis.B:Diphtheria 3.186e-04 1.375e-04
## Measles:Diphtheria -2.662e-06 1.089e-06
## infant.deaths:Diphtheria 1.288e-04 5.247e-05
## infant.deaths:Polio -2.228e-04 7.456e-05
## infant.deaths:Income.composition.of.resources 3.450e-02 1.216e-02
## Population:Income.composition.of.resources -8.581e-08 3.626e-08
## percentage.expenditure:Population 6.886e-12 2.943e-12
## Hepatitis.B:Population 1.390e-10 5.800e-11
## Polio:GDP -1.796e-06 7.294e-07
## Hepatitis.B:Polio -2.496e-04 1.451e-04
## Total.expenditure:GDP 4.335e-06 2.566e-06
## Adult.Mortality:percentage.expenditure 6.013e-06 2.333e-06
## Adult.Mortality:GDP -6.794e-07 3.142e-07
## StatusDeveloping:Polio -3.952e-02 2.525e-02
## Alcohol:BMI 2.623e-03 1.599e-03
## Hepatitis.B:thinness.5.9.years -2.394e-03 1.643e-03
## Population:thinness.5.9.years -1.358e-09 7.128e-10
## Population:thinness..1.19.years 1.192e-09 7.439e-10
## infant.deaths:thinness..1.19.years -1.855e-04 1.331e-04
## t value Pr(>|t|)
## (Intercept) 3.602 0.000325 ***
## StatusDeveloping 3.111 0.001893 **
## Adult.Mortality 5.188 2.38e-07 ***
## infant.deaths 1.976 0.048341 *
## Alcohol -0.557 0.577669
## percentage.expenditure 2.668 0.007701 **
## Hepatitis.B -2.892 0.003873 **
## Measles -0.055 0.955767
## BMI 10.966 < 2e-16 ***
## Polio 3.928 8.91e-05 ***
## Total.expenditure 0.319 0.749759
## Diphtheria 3.343 0.000846 ***
## HIV.AIDS -8.797 < 2e-16 ***
## GDP 2.531 0.011459 *
## Population 1.047 0.295194
## thinness..1.19.years -4.159 3.35e-05 ***
## thinness.5.9.years 2.465 0.013807 *
## Income.composition.of.resources 4.468 8.42e-06 ***
## Schooling 1.142 0.253712
## Adult.Mortality:HIV.AIDS 11.318 < 2e-16 ***
## BMI:Diphtheria -6.515 9.50e-11 ***
## thinness..1.19.years:thinness.5.9.years 5.821 6.94e-09 ***
## percentage.expenditure:Income.composition.of.resources -3.329 0.000889 ***
## Income.composition.of.resources:Schooling 5.081 4.15e-07 ***
## BMI:Schooling -8.897 < 2e-16 ***
## Adult.Mortality:Schooling -3.184 0.001477 **
## HIV.AIDS:Income.composition.of.resources 1.723 0.085147 .
## Total.expenditure:thinness..1.19.years -5.580 2.79e-08 ***
## Alcohol:thinness.5.9.years -3.362 0.000789 ***
## StatusDeveloping:Schooling 5.834 6.43e-09 ***
## StatusDeveloping:Income.composition.of.resources -6.214 6.44e-10 ***
## StatusDeveloping:Adult.Mortality -4.961 7.69e-07 ***
## Measles:HIV.AIDS -3.211 0.001345 **
## Alcohol:HIV.AIDS 5.804 7.69e-09 ***
## Alcohol:GDP -2.603 0.009306 **
## Adult.Mortality:Diphtheria -2.591 0.009646 **
## Adult.Mortality:Alcohol -3.487 0.000500 ***
## Polio:Schooling -5.563 3.06e-08 ***
## Diphtheria:Income.composition.of.resources 4.756 2.14e-06 ***
## StatusDeveloping:BMI -2.799 0.005190 **
## Hepatitis.B:Total.expenditure 3.846 0.000124 ***
## Total.expenditure:HIV.AIDS 5.563 3.06e-08 ***
## Adult.Mortality:Total.expenditure -3.073 0.002151 **
## percentage.expenditure:HIV.AIDS -3.228 0.001272 **
## Alcohol:Total.expenditure -3.673 0.000247 ***
## percentage.expenditure:thinness.5.9.years 3.881 0.000108 ***
## StatusDeveloping:Alcohol 2.052 0.040336 *
## StatusDeveloping:thinness..1.19.years 3.502 0.000474 ***
## Alcohol:Schooling 2.858 0.004310 **
## StatusDeveloping:GDP 3.345 0.000841 ***
## percentage.expenditure:GDP 3.362 0.000792 ***
## HIV.AIDS:Schooling -2.509 0.012206 *
## HIV.AIDS:thinness..1.19.years 1.588 0.112485
## Measles:BMI -2.737 0.006261 **
## StatusDeveloping:infant.deaths -1.956 0.050566 .
## Measles:Polio 3.477 0.000519 ***
## infant.deaths:Total.expenditure -2.436 0.014931 *
## Total.expenditure:Population 4.073 4.85e-05 ***
## Alcohol:Population -2.814 0.004950 **
## Total.expenditure:Diphtheria -2.637 0.008432 **
## Alcohol:Polio 2.156 0.031216 *
## Hepatitis.B:thinness..1.19.years 2.883 0.003991 **
## Alcohol:Income.composition.of.resources -2.687 0.007275 **
## Diphtheria:thinness.5.9.years -2.559 0.010591 *
## BMI:Polio 2.324 0.020241 *
## infant.deaths:Hepatitis.B -2.896 0.003827 **
## Total.expenditure:Income.composition.of.resources 1.911 0.056119 .
## StatusDeveloping:Total.expenditure 1.894 0.058450 .
## Hepatitis.B:Diphtheria 2.316 0.020658 *
## Measles:Diphtheria -2.444 0.014627 *
## infant.deaths:Diphtheria 2.456 0.014164 *
## infant.deaths:Polio -2.989 0.002842 **
## infant.deaths:Income.composition.of.resources 2.838 0.004595 **
## Population:Income.composition.of.resources -2.366 0.018072 *
## percentage.expenditure:Population 2.340 0.019408 *
## Hepatitis.B:Population 2.397 0.016625 *
## Polio:GDP -2.462 0.013910 *
## Hepatitis.B:Polio -1.719 0.085709 .
## Total.expenditure:GDP 1.689 0.091356 .
## Adult.Mortality:percentage.expenditure 2.577 0.010052 *
## Adult.Mortality:GDP -2.162 0.030754 *
## StatusDeveloping:Polio -1.565 0.117756
## Alcohol:BMI 1.641 0.101080
## Hepatitis.B:thinness.5.9.years -1.457 0.145220
## Population:thinness.5.9.years -1.906 0.056829 .
## Population:thinness..1.19.years 1.603 0.109205
## infant.deaths:thinness..1.19.years -1.394 0.163568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.139 on 1754 degrees of freedom
## Multiple R-squared: 0.9031, Adjusted R-squared: 0.8983
## F-statistic: 190 on 86 and 1754 DF, p-value: < 2.2e-16
test_results$bothint <- predict(bothint, testing)
postResample(pred = test_results$bothint, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.486865 0.871720 2.631852
This is the best model until now both for predicting and for accounting for the variability of the data.
I will define the model of bothsides with interaction, to use in Caret later for a better comparison.
Model1 <- Life.expectancy ~ Status + Adult.Mortality + infant.deaths +
Alcohol + percentage.expenditure + Hepatitis.B + Measles +
BMI + Polio + Total.expenditure + Diphtheria + HIV.AIDS +
GDP + Population + thinness..1.19.years + thinness.5.9.years +
Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +
BMI:Diphtheria + thinness..1.19.years:thinness.5.9.years +
percentage.expenditure:Income.composition.of.resources +
Adult.Mortality:Schooling + BMI:Schooling + Income.composition.of.resources:Schooling +
HIV.AIDS:Income.composition.of.resources + Alcohol:thinness..1.19.years +
Status:Schooling + Status:Income.composition.of.resources +
Total.expenditure:thinness..1.19.years + Status:Adult.Mortality +
Measles:HIV.AIDS + Polio:Schooling + Diphtheria:Income.composition.of.resources + Alcohol:HIV.AIDS + Status:GDP + HIV.AIDS:GDP + percentage.expenditure:thinness..1.19.years + Status:BMI + Total.expenditure:HIV.AIDS + HIV.AIDS:Schooling +
Adult.Mortality:Hepatitis.B + Measles:Total.expenditure +
Status:Alcohol + infant.deaths:BMI + Alcohol:Schooling +
Status:thinness..1.19.years + Status:Total.expenditure +
Hepatitis.B:Measles + infant.deaths:Total.expenditure + Total.expenditure:Population + Alcohol:Population + Status:infant.deaths + Adult.Mortality:Alcohol +
percentage.expenditure:GDP + BMI:GDP + Alcohol:percentage.expenditure +
infant.deaths:thinness.5.9.years + Hepatitis.B:Diphtheria +
Polio:Income.composition.of.resources + BMI:Polio + Alcohol:Polio +
percentage.expenditure:Population + Hepatitis.B:HIV.AIDS +
Diphtheria:HIV.AIDS + Alcohol:Total.expenditure + Total.expenditure:Income.composition.of.resources +
Alcohol:Income.composition.of.resources + infant.deaths:Income.composition.of.resources +
Measles:Income.composition.of.resources + Polio:Diphtheria +
infant.deaths:GDP + Status:Population + Hepatitis.B:thinness..1.19.years +
Hepatitis.B:thinness.5.9.years + Diphtheria:thinness..1.19.years +
infant.deaths:Polio + Diphtheria:GDP + Status:Diphtheria +
Measles:Polio
Let´s plot Observed vs. Predicted values
qplot(test_results$bothint, test_results$Life.expectancy) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
Now, let´s see what are the most important variables, and the most important interactions according to this proceadure.
modboth <- train(Model1, data = training,
method = "lm",
preProc=c('scale', 'center'),
trControl = control)
importance <- varImp(modboth, scale=FALSE)
plot(importance,top=15)
As we can see, BMI seems to be now the most important variable and the most important interactions are “Adult Mortality:HIV”, “BMI:Schooling”. Until now, this is our best model for predicting life expectancy and accounting for the data variability. Nevertheless, it is a huge and impossible model to interpret.
Let´s try also the same model but now through a general linear model, using “Gamma” family.
modboth2 <- train(Model1, data = training,
method = "glm", family="Gamma",
preProc=c('scale', 'center'),
trControl = control)
test_results$modboth2 <- predict(modboth2, testing)
postResample(pred = test_results$modboth2, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.2187054 0.8892586 2.4491829
importance <- varImp(modboth2, scale=FALSE)
plot(importance,top=15)
Now, the model has not got better. Although is fairly good, in comparison with the rest, it does not improve the one with the same variables and interaction, but assuming normality.
Let´s try with other options. First let´s do forward regression with Caret.
for_tune <- train(Life.expectancy ~ ., data = training,
method = "leapForward",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:19),
trControl = control)
for_tune
## Linear Regression with Forward Selection
##
## 1841 samples
## 18 predictor
##
## Pre-processing: scaled (18), centered (18)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1658, 1657, 1657, 1656, 1657, 1656, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 4 4.317675 0.8095993 3.159062
## 5 4.268483 0.8142712 3.148148
## 6 4.175195 0.8219003 3.092526
## 7 4.131657 0.8256070 3.069960
## 8 4.121136 0.8265466 3.060980
## 9 4.099382 0.8285044 3.056625
## 10 4.074994 0.8305700 3.051912
## 11 4.057618 0.8319578 3.041665
## 12 4.067163 0.8311810 3.050688
## 13 4.071590 0.8308270 3.055188
## 14 4.067128 0.8312313 3.053511
## 15 4.065701 0.8313714 3.054043
## 16 4.065881 0.8313733 3.055358
## 17 4.065712 0.8313851 3.055441
## 18 4.065680 0.8313897 3.055641
## 19 4.065680 0.8313897 3.055641
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 11.
plot(for_tune)
Now, let´s inspect which variables are selected and how well does this model perform:
coef(for_tune$finalModel, for_tune$bestTune$nvmax)
## (Intercept) StatusDeveloping
## 68.6932102 -0.5644930
## Adult.Mortality Alcohol
## -1.9452031 -0.7045744
## BMI Polio
## 0.9964461 0.4831664
## Diphtheria HIV.AIDS
## 0.6505967 -2.7200491
## GDP thinness..1.19.years
## 0.6455364 -0.3768232
## Income.composition.of.resources Schooling
## 2.1330433 2.8661297
test_results$frw <- predict(for_tune, testing)
postResample(pred = test_results$frw, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.924330 0.834699 2.982841
As it can be seen, this is not a great model, neither for the R-squared nor for the MAE or RMSE. We can see the Observed vs. Predicted values as it follows:
qplot(test_results$frw, test_results$Life.expectancy) +
labs(title="Forward Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
Let´s now check with backward regression
back_tune <- train(Life.expectancy ~ ., data = training,
method = "leapBackward",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:19),
trControl = control)
back_tune
## Linear Regression with Backwards Selection
##
## 1841 samples
## 18 predictor
##
## Pre-processing: scaled (18), centered (18)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1657, 1657, 1658, 1658, 1657, 1656, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 4 4.319974 0.8087610 3.158816
## 5 4.249381 0.8149485 3.127064
## 6 4.155671 0.8232294 3.073276
## 7 4.107750 0.8271224 3.050102
## 8 4.096536 0.8280898 3.042170
## 9 4.075513 0.8297994 3.048187
## 10 4.068795 0.8301657 3.043452
## 11 4.056296 0.8311632 3.039057
## 12 4.077251 0.8293977 3.059022
## 13 4.077811 0.8293867 3.060592
## 14 4.073521 0.8297927 3.058771
## 15 4.072141 0.8299535 3.058664
## 16 4.074264 0.8297790 3.058976
## 17 4.074799 0.8297339 3.059962
## 18 4.074977 0.8297198 3.060346
## 19 4.074977 0.8297198 3.060346
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 11.
plot(back_tune)
Let´s see what variables have been selected and how the model performs:
coef(back_tune$finalModel, back_tune$bestTune$nvmax)
## (Intercept) StatusDeveloping
## 68.6932102 -0.5644930
## Adult.Mortality Alcohol
## -1.9452031 -0.7045744
## BMI Polio
## 0.9964461 0.4831664
## Diphtheria HIV.AIDS
## 0.6505967 -2.7200491
## GDP thinness..1.19.years
## 0.6455364 -0.3768232
## Income.composition.of.resources Schooling
## 2.1330433 2.8661297
test_results$bw <- predict(back_tune, testing)
postResample(pred = test_results$bw, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.924330 0.834699 2.982841
Here, we get pretty much the same result as with the forward regression. We can see it in the observed vs. predicted values plot (the similarities with the one before:
qplot(test_results$bw, test_results$Life.expectancy) +
labs(title="Backward Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
Let´s now check stepwise regression:
step_tune <- train(Life.expectancy ~ ., data = training,
method = "leapSeq",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:19),
trControl = control)
plot(step_tune)
Let´s see which variables are selected and how the model performs:
coef(step_tune$finalModel, step_tune$bestTune$nvmax)
## (Intercept) StatusDeveloping
## 68.69321021 -0.52859026
## Adult.Mortality infant.deaths
## -1.95757329 0.03839523
## Alcohol percentage.expenditure
## -0.68906856 0.23459228
## Hepatitis.B Measles
## -0.19536435 -0.17781085
## BMI Polio
## 0.98664046 0.49293815
## Total.expenditure Diphtheria
## 0.15088380 0.72924319
## HIV.AIDS GDP
## -2.73748007 0.36769269
## Population thinness..1.19.years
## 0.01786592 -0.46828747
## thinness.5.9.years Income.composition.of.resources
## 0.13295805 2.14321792
## Schooling
## 2.85474312
test_results$seq <- predict(step_tune, testing)
postResample(pred = test_results$seq, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.9043953 0.8364015 2.9682015
As it is seen here, we still get the same result (or almost the same one) in comparison with backward and forward regression. We can see it (again) in the plot of observed vs. predicted values in testing set:
qplot(test_results$seq, test_results$Life.expectancy) +
labs(title="Stepwise Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
Let´s try now with our first regularizatino method: Ridge regression. Now, a squared magnitude of the coefficient is added as the penalty term to the loss function. This is good to prevent overfitting. This regression has a parameter that has to be tunned: lambda. If it is zero, then the equation will be the same as multiple linear regression. Instead, if the parameter is large, it will lead to under-fitting.
The bad thing about this regression is that is not useful for getting a parsimonious model, as it tends to shrink coefficient to near zero. So we can not do feature selection with it.
Let´s run this regression, and see how it performs in general.
ridge_grid <- expand.grid(lambda = seq(0, .1, length = 20))
ridge_tune <- train(Life.expectancy ~ ., data = training,
method='ridge',
preProc=c('scale','center'),
tuneGrid = ridge_grid,
trControl=control)
plot(ridge_tune)
ridge_tune$bestTune
## lambda
## 2 0.005263158
test_results$ridge <- predict(ridge_tune, testing)
postResample(pred = test_results$ridge, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.901995 0.836567 2.964577
As we can see, our lambda parameter is pretty small and this may explain why we get similar performance in comparison to the multiple linear regression.
Lasso (least absolute shrinkage and selection operator) performs also regularization. This time, the absolute value of the magnitude of the coefficient is added as the penalty term to the loss funcion. As some variable coefficient can be zero, we can perform feature selection with it.
Here, we also have a tunning parameter: lambda. Let´s see this model performance:
lasso_grid <- expand.grid(fraction = seq(.01, 1, length = 20))
lasso_tune <- train(Life.expectancy ~ . , data = training,
method='lasso',
preProc=c('scale','center'),
tuneGrid = lasso_grid,
trControl=control)
plot(lasso_tune)
lasso_tune$bestTune
## fraction
## 19 0.9478947
test_results$lasso <- predict(lasso_tune, testing)
postResample(pred = test_results$lasso, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.9117949 0.8358292 2.9688481
Again, we get very similar results as with Rdige regression and multiple linear regression. Let´s check the variable importante to get some interpretation of what is happening.
importance <- varImp(lasso_tune, scale=FALSE)
plot(importance)
Elastic net regression that combines, in a linear way, lasso regression and ridge regression (their penalties).
Let´s see if it performs better, although is highly probable it will provide us with a very similar result of the other two.
modelLookup('glmnet')
## model parameter label forReg forClass probModel
## 1 glmnet alpha Mixing Percentage TRUE TRUE TRUE
## 2 glmnet lambda Regularization Parameter TRUE TRUE TRUE
elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))
glmnet_tune <- train(Life.expectancy ~. , data = training,
method='glmnet',
preProc=c('scale','center'),
tuneGrid = elastic_grid,
trControl=control)
plot(glmnet_tune)
glmnet_tune$bestTune
## alpha lambda
## 154 0.13 0.1
test_results$glmnet <- predict(glmnet_tune, testing)
postResample(pred = test_results$glmnet, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.9036419 0.8365401 2.9660770
We get a slightly lower MAE and the same R-squared value as we got in the previous ones.
We can check the value of each coefficient:
coef(glmnet_tune$finalModel, glmnet_tune$bestTune$lambda)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 68.693210212
## StatusDeveloping -0.507931034
## Adult.Mortality -1.960637780
## infant.deaths 0.002951869
## Alcohol -0.613526866
## percentage.expenditure 0.230630014
## Hepatitis.B -0.167257030
## Measles -0.153379641
## BMI 0.983151867
## Polio 0.488482747
## Total.expenditure 0.138442019
## Diphtheria 0.713065025
## HIV.AIDS -2.713100223
## GDP 0.373502218
## Population 0.015669246
## thinness..1.19.years -0.329271778
## thinness.5.9.years .
## Income.composition.of.resources 2.152015642
## Schooling 2.788738105
Let´s see what variables are the most important according to this method:
importance <- varImp(glmnet_tune, scale=FALSE)
plot(importance)
coef(glmnet_tune$finalModel, glmnet_tune$bestTune$lambda)[,1]
## (Intercept) StatusDeveloping
## 68.693210212 -0.507931034
## Adult.Mortality infant.deaths
## -1.960637780 0.002951869
## Alcohol percentage.expenditure
## -0.613526866 0.230630014
## Hepatitis.B Measles
## -0.167257030 -0.153379641
## BMI Polio
## 0.983151867 0.488482747
## Total.expenditure Diphtheria
## 0.138442019 0.713065025
## HIV.AIDS GDP
## -2.713100223 0.373502218
## Population thinness..1.19.years
## 0.015669246 -0.329271778
## thinness.5.9.years Income.composition.of.resources
## 0.000000000 2.152015642
## Schooling
## 2.788738105
Let´s do another kind of regression now: principal component regression. Let´s see how it performs:
pcr_tune <- train(Life.expectancy ~ . , data = training,
method='pcr',
preProc=c('scale','center'),
tuneGrid = expand.grid(ncomp = 2:8),
trControl=control)
plot(pcr_tune)
test_results$pcr <- predict(pcr_tune, testing)
postResample(pred = test_results$pcr, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.9978858 0.8284269 3.0785942
Here we get a worse result even than before. Our R-squared has gone down to 0.8290 and our MAE is up to 3.06, so this model is worse both for accounting variability of the dataset and also for predicting.
Now we will do something similar as PCR, but we will find a linear regression by projecting the predicted values and the observable variables to a new space. Let´s see what happens when we run it:
pls_tune <- train(Life.expectancy ~ . , data = training,
method='pls',
preProc=c('scale','center'),
tuneGrid = expand.grid(ncomp = 2:8),
trControl=control)
plot(pls_tune)
test_results$pls <- predict(pls_tune, testing)
postResample(pred = test_results$pls, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.9059808 0.8362743 2.9680634
As we can see, our results are not that bad now. We get a R-square measure 0.8386 and a MAE of 2.927 (and a RMSE of 3.86).
Now, we can try and see with different approaches the most important variables in our dataset. First let´s do Recursive Feature Elimination.
Here, we can use a different function in order to select a different algorithm. I will use: random forest and bagging of trees.
First with Random Forest:
modelRFE <- rfe(Life.expectancy ~ ., data=training, sizes=c(1:19), rfeControl=rfeControl(functions=rfFuncs, method="cv", number=5))
print(modelRFE)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (5 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 5.801 0.6525 4.492 0.3881 0.046921 0.31979
## 2 3.308 0.8884 2.486 0.1542 0.012040 0.12929
## 3 2.351 0.9447 1.666 0.1306 0.006062 0.07039
## 4 2.335 0.9459 1.627 0.1588 0.007789 0.09882
## 5 2.286 0.9485 1.547 0.1719 0.008049 0.10709
## 6 1.988 0.9596 1.277 0.1169 0.005045 0.05513
## 7 1.990 0.9597 1.268 0.1085 0.004792 0.05535
## 8 1.982 0.9602 1.257 0.1647 0.006735 0.06802
## 9 1.922 0.9623 1.206 0.1564 0.006264 0.07624 *
## 10 1.927 0.9622 1.210 0.1669 0.006691 0.06407
## 11 1.941 0.9617 1.225 0.1747 0.007079 0.07020
## 12 1.937 0.9617 1.223 0.1654 0.006555 0.06697
## 13 1.949 0.9614 1.242 0.1662 0.006682 0.06817
## 14 1.973 0.9604 1.262 0.1709 0.007002 0.07279
## 15 1.975 0.9603 1.267 0.1567 0.006374 0.06268
## 16 1.991 0.9597 1.279 0.1634 0.006742 0.06718
## 17 2.004 0.9593 1.295 0.1499 0.006322 0.05533
## 18 1.981 0.9600 1.280 0.1549 0.006369 0.06539
##
## The top 5 variables (out of 9):
## HIV.AIDS, Income.composition.of.resources, Adult.Mortality, Total.expenditure, Alcohol
predictors(modelRFE)
## [1] "HIV.AIDS" "Income.composition.of.resources"
## [3] "Adult.Mortality" "Total.expenditure"
## [5] "Alcohol" "thinness.5.9.years"
## [7] "infant.deaths" "Schooling"
## [9] "BMI"
ggplot(data = modelRFE, metric = "MAE") + theme_bw()
Here the result is interesting. According to this approach, the top 5 variables are “HIV”, “Income Composition of resources”, “Alcohol” and “BMI”. Let´s see what happens when we do the same but with the function calling for bagging.
Second, with treebagFuncs:
modelRFE <- rfe(Life.expectancy ~ ., data=training, sizes=c(1:19), rfeControl=rfeControl(functions=treebagFuncs, method="cv", number=5))
print(modelRFE)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (5 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 4.693 0.7725 3.052 0.5602 0.050012 0.2077
## 2 3.394 0.8827 2.494 0.2930 0.016414 0.1912
## 3 3.355 0.8851 2.458 0.2323 0.011891 0.1578
## 4 3.346 0.8859 2.441 0.2340 0.012106 0.1975
## 5 3.360 0.8850 2.444 0.2861 0.016235 0.2053
## 6 3.321 0.8881 2.421 0.2034 0.010617 0.1694
## 7 3.337 0.8866 2.431 0.2584 0.014468 0.1936
## 8 3.324 0.8876 2.419 0.2482 0.013797 0.1935
## 9 3.331 0.8870 2.427 0.2497 0.013510 0.1921
## 10 3.337 0.8870 2.423 0.2064 0.009882 0.1671
## 11 3.343 0.8860 2.437 0.2385 0.012892 0.1552
## 12 3.299 0.8893 2.404 0.2475 0.013014 0.1731
## 13 3.311 0.8884 2.408 0.2083 0.010718 0.1600
## 14 3.254 0.8925 2.368 0.2561 0.013633 0.1671 *
## 15 3.320 0.8879 2.405 0.2379 0.013334 0.1844
## 16 3.307 0.8887 2.408 0.2926 0.017124 0.2173
## 17 3.317 0.8881 2.419 0.2422 0.013225 0.1919
## 18 3.310 0.8885 2.411 0.1929 0.009299 0.1398
##
## The top 5 variables (out of 14):
## Adult.Mortality, Income.composition.of.resources, HIV.AIDS, Schooling, BMI
predictors(modelRFE)
## [1] "Adult.Mortality" "Income.composition.of.resources"
## [3] "HIV.AIDS" "Schooling"
## [5] "BMI" "thinness.5.9.years"
## [7] "infant.deaths" "thinness..1.19.years"
## [9] "Alcohol" "GDP"
## [11] "Total.expenditure" "Measles"
## [13] "StatusDeveloping" "Diphtheria"
ggplot(data = modelRFE, metric = "MAE") + theme_bw()
Now the result is different. According to this, the most important variable is “Adult Mortality”, while “HIV” and “BMI” are also included. It is a similar result, although now the top variable is different.
We can use random forest algorithm to meassure the importance of each variable in predicting the output. Let´s see:
fitRF <- train(Life.expectancy ~ .,data=training, method='rf', preProcess="scale", trControl=control)
importance <- varImp(fitRF, scale=FALSE)
plot(importance)
Now the important variables are the same, but in a different order. First we have “income composition of resources” (or HDI), followed by “HIV”, “Adult Mortality” and “Schooling”. In particular, is interesting how the variable “Schooling” is so important to predict the life expectation in a country, as its relationship with the amount of years that someone lives is not straightforward. It is probable that the more educated someone is, the more opportunities of having a better life and thus, live more years.
Let´s see what FOCI has to say about the variable importance.
y <- training$Life.expectancy
x <- training[,3:19]
foci(y,x)$selectedVar
## index names
## 1: 1 Adult.Mortality
## 2: 16 Income.composition.of.resources
## 3: 14 thinness..1.19.years
## 4: 2 infant.deaths
## 5: 11 HIV.AIDS
## 6: 15 thinness.5.9.years
Now, through FUCI, we get the same important variables, although now Thinness from 1-9 years seems to be important, as well as infant deaths and thinness from 5-9 years old. Now HIV is not the most important one, nor the second one.
Boruta is another algorithm to help us check what are the most important variables:
boruta <- Boruta(Life.expectancy ~ ., data=training, doTrace=0)
plot(boruta, cex.axis=.7, las=2, xlab="", main="Variable Importance")
According to Boruta, the most important variable is HIV, followed by Adult Mortality, HDI and Alcohol.
Comparing Models
mods <- resamples(list(lm =lm1, glm = lm2,glm2 = modboth2, AICstep =modboth, Elasticnet = glmnet_tune,Lasso = lasso_tune, Ridge=ridge_tune,PCR=pcr_tune,PLS=pls_tune))
summary(mods)
##
## Call:
## summary.resamples(object = mods)
##
## Models: lm, glm, glm2, AICstep, Elasticnet, Lasso, Ridge, PCR, PLS
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 2.904209 2.977280 3.059957 3.052689 3.130283 3.186402 0
## glm 2.541723 2.973178 3.042020 3.010130 3.120245 3.266869 0
## glm2 2.104592 2.308031 2.435948 2.418056 2.554022 2.623127 0
## AICstep 2.057897 2.244927 2.359047 2.403024 2.552085 2.867144 0
## Elasticnet 2.719857 2.872102 3.083121 3.046356 3.144336 3.516041 0
## Lasso 2.682250 2.972440 3.056770 3.043569 3.183607 3.365308 0
## Ridge 2.733071 2.981166 3.007773 3.056833 3.145932 3.570312 0
## PCR 2.903939 3.038067 3.182672 3.198087 3.343363 3.576914 0
## PLS 2.840729 2.880723 3.082646 3.055828 3.201801 3.311333 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 3.700169 3.966548 4.101958 4.070353 4.172824 4.333241 0
## glm 3.197790 3.887228 4.096885 3.985362 4.188647 4.387456 0
## glm2 2.703738 3.149077 3.281939 3.308094 3.582901 3.662857 0
## AICstep 2.740023 3.020834 3.194452 3.278964 3.518878 3.910933 0
## Elasticnet 3.680510 3.852835 4.077948 4.062508 4.167902 4.633414 0
## Lasso 3.401217 3.903813 4.093806 4.070539 4.356432 4.472962 0
## Ridge 3.583234 3.951096 4.073430 4.069262 4.165969 4.752926 0
## PCR 3.730052 4.109398 4.201814 4.250417 4.420046 4.808915 0
## PLS 3.544669 3.830500 4.132513 4.063984 4.269903 4.545798 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 0.7862637 0.8241307 0.8376867 0.8302307 0.8412938 0.8527473 0
## glm 0.8108844 0.8246107 0.8311044 0.8378114 0.8451480 0.8828211 0
## glm2 0.8614454 0.8734216 0.8815586 0.8868794 0.9024361 0.9229028 0
## AICstep 0.8598332 0.8810161 0.8922125 0.8897229 0.9002519 0.9171064 0
## Elasticnet 0.7868369 0.8063373 0.8293767 0.8298865 0.8525653 0.8700561 0
## Lasso 0.8055561 0.8155135 0.8282103 0.8312472 0.8474812 0.8637735 0
## Ridge 0.7929892 0.8140624 0.8359233 0.8322962 0.8448253 0.8686858 0
## PCR 0.7651896 0.8019927 0.8200900 0.8147053 0.8299610 0.8454111 0
## PLS 0.7855012 0.8144389 0.8329310 0.8311162 0.8486403 0.8640585 0
When compring the different achieved models, we can say that:
Regarding R-squared: in average, the best model is glm2: the general linear model that takes a lot of interactions and variables, and uses a gamma family for the target variable. This model can explain 88% of the total variability in the dataset. When chasing a simpler model, the general linear model with gamma family is pretty good too.
Regarding MAE: in average the best model is also glm2: as it gets in average a mean absolute error in prediction of 3.292351. When checking the minimums, the model that gets the lowest MAE is the multiple regression with a lot of interactions and variables. When looking for a more parsimonious model, we can find that general linear model with all the variables but no interaction gets to (minimum) 2.71.
A good way of finding a better model from what has been done until now, is to combine some of them and check if by calculating the average of their predictions we get to a better result. Let´s try!
test_results$comb = (test_results$modboth + test_results$modboth2 + test_results$lm2 +test_results$forwint)/4
postResample(pred = test_results$comb, obs = test_results$Life.expectancy)
## RMSE Rsquared MAE
## 3.2752265 0.8852543 2.4989780
This is a fairly good model. The mean absolute error is one of the smallest, and is accompanied
Now it´s time to calculate some intervals for the predictions (different than the regular confidence interval).
I will calculate these prediction intervals with my best linear model first, the one with many variables and interactions.
modelo <- lm(Model1,training)
predictions <- predict.lm(modelo, newdata=testing, interval="prediction", level=0.90)
head(predictions)
## fit lwr upr
## 3 62.92031 57.60022 68.24040
## 8 59.78423 54.44552 65.12293
## 12 55.82064 50.17591 61.46537
## 17 76.08611 70.84194 81.33028
## 21 75.02596 69.78311 80.26882
## 23 73.91584 68.66886 79.16281
predictions=as.data.frame(predictions)
predictions$real = (test_results$Life.expectancy)
predictions = predictions %>% mutate(out=factor(if_else(real<lwr | real>upr,1,0)))
mean(predictions$out==1)
## [1] 0.0940919
We can see that somewhere around 10% of the real observations lay outside of the prediction interval. Let´s check it in the plot:
ggplot(predictions, aes(x=fit, y=real))+
geom_point(aes(color=out)) + theme(legend.position="none") +
xlim(10, 100) + ylim(10, 100)+
geom_ribbon(data=predictions,aes(ymin=lwr,ymax=upr),alpha=0.3) +
labs(title = "Prediction intervals", x = "prediction",y="real life expectancy")
As we can see, our prediction intervals are pretty good, as less than 10% (slightly) of the rel values are outside of the interval, and most of them are inside.
Let´s see what happens when we want to build prediction intervals for the model we just created by combining different ones.
yhat = (test_results$comb)
head(yhat)
## [1] 62.62510 59.90074 56.44643 76.29174 75.08380 73.92900
hist(yhat, col="lightblue")
As we can see, the distribution of the predictions looks kind of like a normal distribution, although it is a bit skewed. Let´s check the error in prediction:
y = (test_results$Life.expectancy)
error = y-yhat
hist(error, col="lightblue")
The error distribution is a bit more asymetric than the prediction distribution, but it also looks more centered (the median is close to zero). Now, I am going to split the testing set in two different parts. This is due to the fact that I can not use the same testing set to get the confidence interval. This way, I will compute with one part the noise, and with the other I will obtain the intervals. As I have 459 observations in my testing set, I will use the first 180 for the noise, and the rest to build the interval.
noise = error[1:180]
sd.noise = sd(noise)
hist(noise)
Now, I can use the rest of the observation in the testing set to build the interval. As we can see in the histogram above, the noise behaves very similar to a normal distribution. We can check with a Shapiro-Wilk test (normality test).
x.test <- shapiro.test(noise)
x.test
##
## Shapiro-Wilk normality test
##
## data: noise
## W = 0.98868, p-value = 0.1613
As we can see, the p-value of the test is higher than 0.05, meaning that we do not reject the null hypotesis of normality at a 95% of confidence level. I can proceed to build the intervals now. As we know, when a distribution behaves like a normal, 90% of the data is supossed to be inside 1.65 standard deviations. Like follows:
lwr = yhat[181:length(yhat)] - 1.65*sd.noise
upr = yhat[181:length(yhat)] + 1.65*sd.noise
pred <- yhat[181:length(yhat)]
head(cbind(lwr,upr,pred))
## lwr upr pred
## [1,] 65.11959 75.53269 70.32614
## [2,] 66.95929 77.37238 72.16583
## [3,] 62.51743 72.93053 67.72398
## [4,] 60.57222 70.98532 65.77877
## [5,] 61.26118 71.67427 66.46772
## [6,] 74.00968 84.42278 79.21623
Here we see the first intervals of each value of prediction of life expectancy of a country.
Let´s do the same, now with a non-parametric way (we will get the same result):
lwr = yhat[181:length(yhat)] + quantile(noise,0.05, na.rm=T)
upr = yhat[181:length(yhat)] + quantile(noise,0.95, na.rm=T)
head(cbind(lwr,upr,pred))
## lwr upr pred
## [1,] 65.01702 75.71331 70.32614
## [2,] 66.85672 77.55300 72.16583
## [3,] 62.41486 73.11115 67.72398
## [4,] 60.46965 71.16594 65.77877
## [5,] 61.15861 71.85489 66.46772
## [6,] 73.90711 84.60340 79.21623
Now we can check if there are any kind of anomalies (outliers), by checking first in a more conservative way, and then in a more extreme way. If there are outliers in the second plot, it means they are extreme outliers.
lwr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=1.5)$stats[1]
upr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=1.5)$stats[5]
y2 <- y[181:length(y)]
yhat2 <- yhat[181:length(yhat)]
df <- data.frame(y2,yhat2,lwr,upr)
head(df)
## y2 yhat2 lwr upr
## 1 73.9 70.32614 62.85375 77.50882
## 2 73.0 72.16583 64.69345 79.34852
## 3 71.9 67.72398 60.25159 74.90666
## 4 71.6 65.77877 58.30638 72.96145
## 5 71.0 66.46772 58.99534 73.65041
## 6 75.6 79.21623 71.74384 86.39891
df = df %>% mutate(out=factor(if_else(y2<lwr | y2>upr,1,0)))
mean(df$out==1)
## [1] 0.05415162
As we can see, only 4.6% of the predictions are outside the interval.
Now, this is even more robust:
lwr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=3)$stats[1]
upr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=3)$stats[5]
y2 <- y[181:length(y)]
yhat2 <- yhat[181:length(yhat)]
df <- data.frame(y2,yhat2,lwr,upr)
df = df %>% mutate(out=factor(if_else(y2<lwr | y2>upr,1,0)))
mean(df$out==1)
## [1] 0.01083032
Now, only 2% of the predictions are outside the interval. This is a very good interval then.